This course is about Open Data, Open Science and Data Science. We learn to use the R programming language!
Here is a link to my repository
This week I have learned how to edit datasets, and how to represent the dataset graphically with R. I have also practiced evaluating the results from a linear regression model.
First I name my dataset learning2014 and read my dataset that will be used for analysis into R.
learning2014 <- read.table("/Users/Noora/Documents/IODS-project/data/learning2014.txt", header = TRUE)
The dataset contains information about students who have taken an exam for a specific course (Johdatus yhteiskunta- tilastotieteeseen). In the dataset I have combined information about each participating student’s gender, age, attitude (global attitude toward statistics), deep learning score, strategic learning score, surface learning score and exam points. Scores for attitude, and deep, strategic and surface learning have been formulated from each student’s responses to a questionnare, and are shown in the dataset using the Likert scale with 5 levels.
I don’t show the full dataset here, but it could be shown just by writing code learning2014 in R. I show, however, the first 6 observations of the dataset just to give a reader unfamiliar with the dataset some kind of idea about the dataset. In my code I first tell R to load knitr package that I have installed. I use knitr to make the table visually better (easier to read), this is the knitr::kable part of the code below, and use the head code to show the first 6 observations.
library(knitr)
knitr::kable(head(learning2014))
| gender | age | attitude | deep | stra | surf | points |
|---|---|---|---|---|---|---|
| F | 53 | 3.7 | 3.583333 | 3.375 | 2.583333 | 25 |
| M | 55 | 3.1 | 2.916667 | 2.750 | 3.166667 | 12 |
| F | 49 | 2.5 | 3.500000 | 3.625 | 2.250000 | 24 |
| M | 53 | 3.5 | 3.500000 | 3.125 | 2.250000 | 10 |
| M | 49 | 3.7 | 3.666667 | 3.625 | 2.833333 | 22 |
| F | 38 | 3.8 | 4.750000 | 3.625 | 2.416667 | 21 |
I can explore the dimensions of my dataset to find out how many rows and columns it has.
dim(learning2014)
## [1] 166 7
This shows that the dataset has 166 rows (observations) and 7 columns (variables).
I can also explore the structure of the dataset.
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
This shows again that the dataset has 166 observations and 7 variables, but the output also shows the names of the variables (e.g. gender, age) and the type of data they contain (e.g. integers for age) and some of the first observations.
Next, I look at a graphical overview of the data. I use the packages GGally and ggplot2 here.
library(GGally)
library(ggplot2)
GraphicalOverview <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.5), title = "Graphical Overview", lower = list(combo = wrap("facethist", bins = 20)))
GraphicalOverview
I can also get summaries of all the variables in my dataset.
Summary of variable gender:
summary(learning2014$gender)
## F M
## 110 56
Summary of variable age:
summary(learning2014$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 21.00 22.00 25.51 27.00 55.00
Summary of variable attitude:
summary(learning2014$attitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 2.600 3.200 3.143 3.700 5.000
Summary of variable deep:
summary(learning2014$deep)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.583 3.333 3.667 3.680 4.083 4.917
Summary of variable stra:
summary(learning2014$stra)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.250 2.625 3.188 3.121 3.625 5.000
Summary of variable surf:
summary(learning2014$surf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.583 2.417 2.833 2.787 3.167 4.333
Summary of variable points:
summary(learning2014$points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 19.00 23.00 22.72 27.75 33.00
The summaries give basic information about the variables. For all variables except gender, the output shows the minimum and maximum values of the variable in my dataset, as well as other key figures, i.e. the mean, median and 1st and 3rd quartile values. For example for variable points I see that all values are between 7 and 33, and 50% of the values are between 19 and 27.75.
For gender, the output shows the frequency of events, i.e. that out of the students, 110 are females and 56 males. The first row of the graphical overview also gives this same information. The same row also gives information about the minimum, maximum, median and 1st and 3rd quartile values for each variable for the two different genders (e.g. I can see the median age of males in the dataset). These are difficult to see accurately from the graphical overview picture, however.
The graphical overview shows also correlation values for all variable pairs and the scatter plots can give some indication of the relationship between variables. For example it can be seen that when looking at attitude, the highest point values correspond to high attitude values and vice versa. For age, however, both the highest and lowest point values seem to correspond to relatively low age values. Thus I can assume that there is a relationship between attitude and points but not between age and points.
Next I will create a regression model with three explanatory variables age, gender and attitude, and exam points as the target variable. In other words, I am trying to invstigate if the variables age, gender and attitude can be used to predict the value of exam points. Earlier I saw that attitude might be related to points but age most likely isn’t, so let’s see what this regression model reveals!
I will conduct some tests on my model to see if the three variables can be used to predict the value of exam points or not. Let’s view the summary of the model:
Model <- lm(points ~ age + gender + attitude, data = learning2014)
summary(Model)
##
## Call:
## lm(formula = points ~ age + gender + attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## age -0.07586 0.05367 -1.414 0.159
## genderM -0.33054 0.91934 -0.360 0.720
## attitude 3.60657 0.59322 6.080 8.34e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
The result shows that with this model, it is possible to explain about 20% of the variance of points. The regression line intercepts the y-axis at y-value 13.4 and attitude is positively related to points (increase in attitude increases points) whereas the other variables are negatively related.
The results of this test indicate that the null hypothesis that the variable’s coefficient is 0 is valid for age and gender, as the P values of both are greater than 0.05 (0.159 and 0.720 respectively). Based on this information I should remove these two variables. Attitude, however, seems to be statistically significant and can be left in the model. Of course there might be reasons why one of the other variables should be kept in the model and not removed just because of its P-value but I don’t have any expertise in this topic that could help me justify keeping variables with high P values.
Let’s conduct the same test but this time with just the variable attitude.
Model2 <- lm(points ~ attitude, data = learning2014)
summary(Model2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
According to this test, attitude is statistically significant, so I keep it in out model!
Based on the output of the summary I see that the effect of attitude on exam points is 3.5 and the variables are positively related. In other words, if attitude improves by 1, exam points improve by 3.5.
I can also see the value of parameter a (alpha), i.e. the point where the regression line intercepts the y-axis. This value in my model is 11.6. This implies that a person with attitude 0 would get 11.6 points from the exam. I can also see that the standard error of attitude is 0.6 and the P-value very small (4.12 x 10^-09).
The multiple r-squared value tells how well the model explains the target variable exam points. I see that around 19% of the variation of points is explained by attitude. I can say that attitude seems to contribute to success in the exam (exam points), but there are also clearly other variables that affect success in the exam as well.
I can also draw diagnostic plots about my model. I want the diagnostic plots to sho residuals vs. fitted values, normal QQ-plot and residuals vs leverage. These will be explained in analysis of the diagnostic plots.
par(mfrow=c(2,2))
plot(Model2, which = c(1, 2, 5))
Because my model is about linear regression, it assumes that there is a linear relationship between attitude and exam points. Another important assumption is that there is a random variable (e) that adds noise to the observations This e describes the errors of the model. The linear regression model has several assumptions related to the errors. It is assumed that the errors 1) are normally distributed, 2) aren’t correlatd, and 3) have constant variance (i.e. the size of an error doesn’t depend on the explanatory variables, in this case attitude).
Analysing the residuals of the model allows me to investigate the validity of these assumptions.
The first plot, residuals vs. fitted, can be used to evaluate if the errors have constant variance. If there is any pattern in the scatter plot, there is a problem with the assumptions of the model. However, there doesn’t appear to be any pattern in my scatter plot, the spread of points seems random. Thus the plot doesn’t imply there would be problems with my model’s assumptions.
The second plot, normal QQ, can be used to evaluate the normality of the errors. The better the points fit to the line in the plot, the better they fit the normality assumption. From the QQ-plot I can see that most of the findings fit the normality assumption, except for in the beginning and end of the line, where there is more deviation from the line. Thus the normality assumption might be questionable.
The third plot, residuals vs. leverage, can be used to see how great impact a single obsrvation has on the model. There might be e.g. one observation that is causing the slope of the regression line to significantly change. In my plot, there are a couple of observations at the far right side of the plot. However, the leverage of these observations is around 0.04 so extremely low. Thus no single observation seems to have a high impact to my model.
Based on the diagnostic plots, I can reasonably assume that my model’s assumptions are valid, meaning that also the results of tests are valid.
First I name my dataset studnt_alc and read it to R from my local folder.
student_alc <- read.table("/Users/Noora/Documents/IODS-project/data/create_alc.txt", header = TRUE)
The names of the variables in the dataset are:
colnames(student_alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The dataset contains information about student alcohol use in Portugal. The original dataset that is available from here has been modified so that two separate datasets have been joined. These joined dataset has information about each student (e.g. age, gender, mother’s job, whether they have internet access or not), about their alcohol consumption (e.g. workday & weekend alcohol consumption), as well as about their school performance on mathematics and Portuguese. The variables shown above give a fuller picture of all the 35 variables in the dataset. The dataset has 382 observations.
I will next study the relationships between high/low alcohol consumption and the following four variables: studytime, age, failures, and romantic. Here are my personal hypotheses about how each variable is related to alcohol consumption:
I will next look at the distributions of the variables by printing out summaries of them:
summary(student_alc$studytime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.037 2.000 4.000
summary(student_alc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 16.00 17.00 16.59 17.00 22.00
summary(student_alc$failures)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2016 0.0000 3.0000
summary(student_alc$romantic)
## no yes
## 261 121
summary(student_alc$high_use)
## Mode FALSE TRUE NA's
## logical 268 114 0
Study time describes the student’s weekly study time on a scale of 1 to 4 (1 = <2h, 2 = 2-5h, 3 = 5-10h, 4 = >10h). The mean score for study time in the dataset is 2.037, so the averag student studies between 2 and 5 hours per week. There are some who study less than 2h and some more than 10h (min is 1 and max is 4).
The age distribution in the dataset is between 15 and 22 with on average th students being 17. Most students have no past failures, with the median of failures being 0 and mean being 0.2016. The maximum number of failures is 3. Most students are not in a romantic relationship (261 out of 382, i.e. almost 70%).
Additionally, I explored the amount of students belonging to high_use group. The majority of students (n = 268) don’t consume alcohol a lot, but a large minority (n = 114) do.
Let’s next investigate some plots about the variables.
Study time
library(ggplot2)
ba1 <- ggplot(student_alc, aes(x = studytime, col = high_use))
ba1 + geom_bar(position = "dodge") + ggtitle("Study time")
bo1 <- ggplot(student_alc, aes(x = high_use, y = studytime, col = high_use))
bo1 + geom_boxplot() + ggtitle("Study time")
It is clear from the bar plot that although approximately the same number of students without high use have study time 1 and 3, for those with high use, the number of students with study time 1 is much greater than those with 3. Also, there are over two times more students without high use whose study time is 2 than there are those with study time 1. For students with high use, the difference is much smaller. Thus it could be assumed that my hypothesis that students with high alcohol consumption spend less time studying than those without high use seems to be supported. The box plot also supports this view.
Age
ba2 <- ggplot(student_alc, aes(x = age, col = high_use))
ba2 + geom_bar(position = "dodge") + ggtitle("Age")
bo2 <- ggplot(student_alc, aes(x = high_use, y = age, col = high_use))
bo2 + geom_boxplot() + ggtitle("Age")
The bar and box plots suggest that there might be some differences in students’ alcohol consumption depending on their age. Relatively small percentage of students aged 15 have high use of alcohol, but as the students get older, the percentage of students of same age who have high use seems to increase. Thus my hypothesis that students of different ages have similar alcohol consumption rates might be incorrect.
Failures
ba3 <- ggplot(student_alc, aes(x = failures, col = high_use))
ba3 + geom_bar(position = "dodge") + ggtitle("Failures")
bo3 <- ggplot(student_alc, aes(x = high_use, y = failures, col = high_use))
bo3 + geom_boxplot() + ggtitle("Failures")
From the bar plot I see that there are about the same number of students with and without high use who have 1, 2 and 3 failures. This is interesting because there are far more students without high use than there are with high use. This would suggest the percentage of students with high use who have failures is higher than that of those without high use. This would suggest that my hypothesis that students with high alcohol consumption have more failures is correct. However, it is also clear from both of the plots that the number of students with failures is in both cases (with and without high use) quite small compared to the number of students without failures.
Romantic
ba4 <- ggplot(student_alc, aes(x = romantic, col = high_use))
ba4 + geom_bar(position = "dodge") + ggtitle("Romantic")
From the bar plot I see that most students don’t have a romantic relationship. Those without high use seem a bit more likely to have a relationship than those with high use, but the difference is quite small. Thus my hypothesis that students with a romantic relationship have high alcohol consumption, single students have lower consumption is not supported by this plot.
Next I still look at some cross tabulations of the variables. I use the library knitr to present the results of the code as nicer looking tables (the knitr::kable part of my code does this).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
crosstab1 <- student_alc %>% group_by(high_use, romantic) %>% summarise(count = n())
crosstab2 <- student_alc %>% group_by(high_use) %>% summarise(mean_studytime = mean(studytime), mean_age = mean(age),mean_failures = mean(failures))
knitr::kable(crosstab1)
| high_use | romantic | count |
|---|---|---|
| FALSE | no | 180 |
| FALSE | yes | 88 |
| TRUE | no | 81 |
| TRUE | yes | 33 |
knitr::kable(crosstab2)
| high_use | mean_studytime | mean_age | mean_failures |
|---|---|---|---|
| FALSE | 2.149254 | 16.50000 | 0.1417910 |
| TRUE | 1.771930 | 16.78947 | 0.3421053 |
From the first cross-tabulation with high use and romantic variables, I see that 88 out of the 268 students without high use of alcohol have a relationship (i.e. 33%), and 33 out of the 114 with high use have a romantic relationship (i.e. 29 %). This again just shows that my hypothesis is likely to be wrong about romantic relationships and alcohol consumption.
From the second cross-tabulation I see that the average study time for those without high use is indeed greater than for those with high use (although the difference is not very large), supporting my earlier discussion about study time. The average age of students with and without high use seems quite similar, which might indicate that my hypothesis about age not being related to alcohol consumption might be actually true. For failures, the difference between the means is small, but it might still be possible my hypothesis is correct.
So far I have mixed results with my hypotheses and will conduct more analysis.
First I fit a logistic regression model with high use as the target variable and study time, age, failures and romantic as the predictors. I create a summary of the fittd model:
mymodel <-glm(high_use ~ studytime + age + failures + romantic, family = "binomial", data = student_alc)
summary(mymodel)
##
## Call:
## glm(formula = high_use ~ studytime + age + failures + romantic,
## family = "binomial", data = student_alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4680 -0.8575 -0.7179 1.2735 2.2796
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1149 1.6999 -1.832 0.066888 .
## studytime -0.5483 0.1583 -3.464 0.000531 ***
## age 0.2000 0.1027 1.948 0.051458 .
## failures 0.3080 0.1940 1.588 0.112373
## romanticyes -0.2132 0.2546 -0.838 0.402263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 439.96 on 377 degrees of freedom
## AIC: 449.96
##
## Number of Fisher Scoring iterations: 4
From the summary I see that study time is negatively related to high use (coefficient -0.55), and both age and failures are positively related (coefficients 0.20 and 0.31 respectively). Romantic relationships seem to be negatively related to high use (-0.21).
When looking at the p-values, study time is statistically significant (with p-value much below 0.05), meaning that it’s coefficient is not 0 (it is related to high use). For age, the p-value is 0.051, so it is difficult to say if it is or isn’t statistically significant. Failures has p-value of 0.11, and thus could probably be removed from the model (th null hypothesis of the test that failures’ coefficient is 0 seems valid). For romantic relationships, the p-value is 0.40 suggesting that it isn’t statistically significant and can be removed from the model. I will not remove any varibles yet.
I will also look at the coefficients of my model (as odds ratios) and their confidence intervals.
odds_r <- coef(mymodel) %>% exp
conf_int <- confint(mymodel) %>% exp
## Waiting for profiling to be done...
cbind(odds_r, conf_int)
## odds_r 2.5 % 97.5 %
## (Intercept) 0.04438302 0.001526459 1.2137417
## studytime 0.57795482 0.419696413 0.7815616
## age 1.22142684 1.000405322 1.4977093
## failures 1.36071556 0.931252861 2.0033979
## romanticyes 0.80795889 0.486401760 1.3227895
From the odds ratios and their 95% confidence intervals I can see that two confidence intervals contain 1: those of failures and romantic. This means that the odds of success with X (e.g. romantic relationship) are the same as odds of success without it. This suggest that these two variables are not related to high use (their presence doesn’t change the odds of having high use). This again confirms my earlier suggestions that these two variables can be removed from the model.
Study time, on the other hand, doesn’t contain the value 1 in its confidence interval. Thus it seems that study time does have a relationship with high use and should be kept in the model. The odds ratio for study time is less than 1, suggesting that study tim is negatively associated with high use, as also found earlier.
Age, again, is more complex to interpret. Its confidence interval comes very close to 1 (1.0004), so it is again difficult to say that it does or doesn’t clearly have a relationship with high use. It seems that age is positively related with high use, as also mentioned before, but that the effect is fairly weak (odds ratio is 1.22 so close to 1). I will keep age in my model for now as the evidence against keeping it is not very strong.
When comparing these results to my hypotheses, it seems that only my hypothesis about study time is supported by the model (study time is negatively related to high use). My hypothesis about age (no relationship) might or might not be correct - it seems there is a weak relationship. My hypotheses about failures and romantic seem both incorrect as my model suggests no relationship between either of these variables and high use.
So, now I change my model to only include study time and age as the predictors.
mymodel2 <-glm(high_use ~ studytime + age, family = "binomial", data = student_alc)
I will look at my model’s predictions and the actual values of my data next.
probabilities <- predict(mymodel2, type = "response")
student_alc <- mutate(student_alc, probability = probabilities)
student_alc <- mutate(student_alc, prediction = probabilities > 0.5)
table(high_use = student_alc$high_use, prediction = student_alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 259 9
## TRUE 102 12
Here is a graphic visualising the actual values and the predictions:
g <- ggplot(student_alc, aes(x = probability, y = high_use, col=prediction))
g + geom_point()
The total proportion of inaccurately classified individuals is calculated with a loss function:
loss_funct <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_funct(class = student_alc$high_use, prob = student_alc$probability)
## [1] 0.2905759
It seems that the proportion of inaccurately classified individuals is 0.29 (29%). The smaller this number is the better the model is at predictions, so my model seems ok but there is room for improvement as nearly a third of individuals are classified in a wrong way. My model seems to mainly inaccurately classify those with high use to not have high use. Out of the 114 with high use, my model classifies only 12 to have high use. However, out of the 268 who don’t have high use, my model correctly classifies 259 individuals.
The performance of my model should still be better than the performance of a simple guessing strategy. For example, let’s say I guess that all individuals with little time spent studying (study time 1 or 2) have high use of alcohol, and all those with much time spent studying (study time 3 or 4) don’t have high use. I use the xtabs function to get a table showing the number of students with high use or without high use with differnt study times. I again use the knitr library to present the table in a nicer format.
guess <- xtabs(~ high_use + studytime, student_alc)
knitr::kable(guess)
| 1 | 2 | 3 | 4 | |
|---|---|---|---|---|
| FALSE | 58 | 135 | 52 | 23 |
| TRUE | 42 | 60 | 8 | 4 |
I see that for there are 58 + 135 = 193 students whose study time is 1 or 2 but who don’t have high use of alcohol. There are also 8 + 4 = 12 students with study time 3 or 4 who have high use of alcohol. Thus my guess would have been incorrect in 193 + 12 = 205 out of 382 cases, so in 54% of cases. Thus my model is far more accurate than my simple guessing strategy.
I load the Boston data and explore its structure and dimensions.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston dataset has 506 observations and 14 variables. The dataset relates to housing valus in Boston’s suburbs. The variables include for example crim (per capita crime rate by town), dis (weighted mean of distances to five Boston employment centres) and ptratio (pupil-teacher ratio by town).
Here is a graphical overview of the data.
library(GGally)
library(ggplot2)
GraphicalOverview <- ggpairs(Boston, mapping = aes(), title = "Graphical Overview", lower = list(combo = wrap("facethist", bins = 20)))
GraphicalOverview
Here is also a correlation plot of the variables:
library(corrplot)
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method = "circle", type = "upper", cl.pos = "r", tl.pos = "d")
Here are summaries of the variables. The package knitr is used to present the results in a nicer table format (the knitr::kable part of the code).
library(knitr)
knitr::kable(summary(Boston))
| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.00632 | Min. : 0.00 | Min. : 0.46 | Min. :0.00000 | Min. :0.3850 | Min. :3.561 | Min. : 2.90 | Min. : 1.130 | Min. : 1.000 | Min. :187.0 | Min. :12.60 | Min. : 0.32 | Min. : 1.73 | Min. : 5.00 | |
| 1st Qu.: 0.08204 | 1st Qu.: 0.00 | 1st Qu.: 5.19 | 1st Qu.:0.00000 | 1st Qu.:0.4490 | 1st Qu.:5.886 | 1st Qu.: 45.02 | 1st Qu.: 2.100 | 1st Qu.: 4.000 | 1st Qu.:279.0 | 1st Qu.:17.40 | 1st Qu.:375.38 | 1st Qu.: 6.95 | 1st Qu.:17.02 | |
| Median : 0.25651 | Median : 0.00 | Median : 9.69 | Median :0.00000 | Median :0.5380 | Median :6.208 | Median : 77.50 | Median : 3.207 | Median : 5.000 | Median :330.0 | Median :19.05 | Median :391.44 | Median :11.36 | Median :21.20 | |
| Mean : 3.61352 | Mean : 11.36 | Mean :11.14 | Mean :0.06917 | Mean :0.5547 | Mean :6.285 | Mean : 68.57 | Mean : 3.795 | Mean : 9.549 | Mean :408.2 | Mean :18.46 | Mean :356.67 | Mean :12.65 | Mean :22.53 | |
| 3rd Qu.: 3.67708 | 3rd Qu.: 12.50 | 3rd Qu.:18.10 | 3rd Qu.:0.00000 | 3rd Qu.:0.6240 | 3rd Qu.:6.623 | 3rd Qu.: 94.08 | 3rd Qu.: 5.188 | 3rd Qu.:24.000 | 3rd Qu.:666.0 | 3rd Qu.:20.20 | 3rd Qu.:396.23 | 3rd Qu.:16.95 | 3rd Qu.:25.00 | |
| Max. :88.97620 | Max. :100.00 | Max. :27.74 | Max. :1.00000 | Max. :0.8710 | Max. :8.780 | Max. :100.00 | Max. :12.127 | Max. :24.000 | Max. :711.0 | Max. :22.00 | Max. :396.90 | Max. :37.97 | Max. :50.00 |
The variables are quite interesting. For example per capita crime rate by town (crim) seems to be quite low in most cases (3rd quartile is 3.68), but the maximum value for crime rate is 88.98, suggesting that some areas are highly different from the more typical areas when it comes to crime rates. For pupil-teacher ratio by town (ptratio), the 1st quartile value is 17.40 and 3rd is 20.20, meaning that 50% of areas have between 17.4 and 20.2 pupils per teacher. However, even with this variable, the difference between min (12.60) and max (22.00) is significant, with in some areas one teacher having one average 10 pupils more than in other areas. Also the proportion of owner-occupied units built prior to 1940 (age) seems to have a large range (min 2.90, max 100.00). There is also a ten-fold difference between the min and max (median) values of homes. The average number of rooms per dwelling (rm) seems to be the only variable that is somewhat normally distributed.
The scatter plots in the graphical overview can give some indication about the relationships between the variables. For example, when looking at home values (medv), the highest home values seem to be in the areas with the lowest lstat (lower status of the population %) values, whereas the high lstat areas have only low home values. The correlation plot also confirms that there is strong (negative) correlation between home values and lstat. The correlation plot also suggests that there is a correlation between the number of rooms per dwelling (rm) and home values. The same is indicated by the scatter plots - the more rooms the dwelling has, the higher the home value, and vice versa. On the other hand, there is no clear relationship between for example proportion of residential land zoned for lots over 25,000 sq.ft (zn) and home value, as high home values seem to correspond to both high and low values of zn. The correlation plot also shows no strong correlation between these two variables.
Next I will scale the dataset and look at a summary of the scaled data (using again knitr to present the summary as a nice table). I also convert the scaled dataset (sBoston) to a data frame format, which will be needed later on.
sBoston <- scale(Boston)
knitr::kable(summary(sBoston))
| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :-0.419367 | Min. :-0.48724 | Min. :-1.5563 | Min. :-0.2723 | Min. :-1.4644 | Min. :-3.8764 | Min. :-2.3331 | Min. :-1.2658 | Min. :-0.9819 | Min. :-1.3127 | Min. :-2.7047 | Min. :-3.9033 | Min. :-1.5296 | Min. :-1.9063 | |
| 1st Qu.:-0.410563 | 1st Qu.:-0.48724 | 1st Qu.:-0.8668 | 1st Qu.:-0.2723 | 1st Qu.:-0.9121 | 1st Qu.:-0.5681 | 1st Qu.:-0.8366 | 1st Qu.:-0.8049 | 1st Qu.:-0.6373 | 1st Qu.:-0.7668 | 1st Qu.:-0.4876 | 1st Qu.: 0.2049 | 1st Qu.:-0.7986 | 1st Qu.:-0.5989 | |
| Median :-0.390280 | Median :-0.48724 | Median :-0.2109 | Median :-0.2723 | Median :-0.1441 | Median :-0.1084 | Median : 0.3171 | Median :-0.2790 | Median :-0.5225 | Median :-0.4642 | Median : 0.2746 | Median : 0.3808 | Median :-0.1811 | Median :-0.1449 | |
| Mean : 0.000000 | Mean : 0.00000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | |
| 3rd Qu.: 0.007389 | 3rd Qu.: 0.04872 | 3rd Qu.: 1.0150 | 3rd Qu.:-0.2723 | 3rd Qu.: 0.5981 | 3rd Qu.: 0.4823 | 3rd Qu.: 0.9059 | 3rd Qu.: 0.6617 | 3rd Qu.: 1.6596 | 3rd Qu.: 1.5294 | 3rd Qu.: 0.8058 | 3rd Qu.: 0.4332 | 3rd Qu.: 0.6024 | 3rd Qu.: 0.2683 | |
| Max. : 9.924110 | Max. : 3.80047 | Max. : 2.4202 | Max. : 3.6648 | Max. : 2.7296 | Max. : 3.5515 | Max. : 1.1164 | Max. : 3.9566 | Max. : 1.6596 | Max. : 1.7964 | Max. : 1.6372 | Max. : 0.4406 | Max. : 3.5453 | Max. : 2.9865 |
sBoston <- as.data.frame(sBoston)
The values of the variables changed so that now the mean value of very variable is 0, and the other values have been scaled with the formula (x - mean(x)) / sd(x). This means that the min and 1st quartile values are below 0 and 3rd quartile and max above 0. The scales of the variables are also closer to each other now.
Next I create a categorical variable ‘crime’ about (scaled) crime rate. The ‘crime’ variable will have 4 level: low, med_low, med_high and high. I also drop the old crime rate variable from my scaled Boston dataset (sBoston) and add the new catgorical crime rate to it.
scaled_crim <- sBoston$crim
bins <- quantile(scaled_crim)
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
sBoston <- dplyr::select(sBoston, -crim)
sBoston <- data.frame(sBoston, crime)
Next I will divide the scaled dataset to train and test sets, with 80% of the data being in th train set.
n <- nrow(sBoston)
ind <- sample(n, size = n*0.8)
train <- sBoston[ind,]
test <- sBoston[-ind,]
Next I will fit the linear discriminant analysis on the train set by using the catgorical crime rate as the target variable and the other variables as predictor variables. I will also draw the LDA (bi)plot but not show the LDA arrows in the biplot.
lda.fit <- lda(crime ~ ., data = train)
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
I will next save the correct classes (categories) for crime rate in the test set and then remove the categorical crime rate variable from the test set for testing purposes. After this I will predict the crime classes with my LDA model on the test data and look at whether the predictions were correct.
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 8 1 0
## med_low 6 13 10 0
## med_high 0 6 18 0
## high 0 0 0 27
From my LDA predictions were correct about 70 or 75% of the time (depending on which obsrvations were used as parts of the train and test sets as these change every time I refresh the document). The LDA predictions for high seem to be mostly correct, but the model has some problems predicting the other classes correctly.
Next I will reload the Boston dataset and scale it again (and call it sBoston 2 now). I will also calculate the distances between the observations using Euclidean distance measure.
data("Boston")
sBoston2 <- scale(Boston)
sBoston2 <- as.data.frame(sBoston2)
dist_eu <- dist(sBoston2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
Next I will use the K-means clustering method on the dataset. I don’t yet know the optimal number of clusters so I use 5 for this run of the algorithm.
km <- kmeans(dist_eu, centers = 5)
pairs(sBoston2, col = km$cluster, lower.panel = NULL)
However, I want to determine the optimal number of clusters.
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type = 'b')
From this I see that the optimal number of clusters seems to be 2 as this is where the largest drop in the WCSS is. Now I do the k-means clustering and visualise the results with a plot.
km <- kmeans(dist_eu, centers = 2)
pairs(sBoston2, col = km$cluster, lower.panel = NULL)
In this visualisation all the data points are assigned to two different clusters (the red and black clusters). It seems that some variables are more relevant for clustering than others. For example points with low rad (index of accessibility to radial highways) values in general belong to the red cluster, and points with high rad values belong to the black cluster. Thus rad seems useful for clustering purposes. Another relevant variable is e.g. tax (full-value property-tax rate). However, for example the values of the variable chas don’t seem to give any indication about which cluster the points belong to, and thus chas is not very relevant for clustering. Thus this visualisation is helpful in showing whether the variables or variable pairs are good indicators about which cluster a specific data point belongs to.
I will start off by exploring the structure and dimensions of my ‘human’ dataset.
human <- read.table("/Users/Noora/Documents/IODS-project/data/create_human_.txt", header = TRUE)
dim(human)
## [1] 155 8
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ High_educ_f_m: num 1.007 0.997 0.983 0.989 0.969 ...
## $ F_m_working : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Educ_exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life_exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI_ : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat_deaths : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Young_births : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parl_seats : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
From the outputs I see that the dataset has 155 observation and 8 variables. However, I know that the row names denote the variable ‘Country’, thus the dataset has in reality 9 variables, not 8. For example the last row of my dataset is:
library(knitr)
knitr::kable(tail(human, n = 1))
| High_educ_f_m | F_m_working | Educ_exp | Life_exp | GNI_ | Mat_deaths | Young_births | Parl_seats | |
|---|---|---|---|---|---|---|---|---|
| Niger | 0.3076923 | 0.4459309 | 5.4 | 61.4 | 908 | 630 | 204.8 | 13.3 |
Here I have 8 named columns, but the first column shows the name of the country in question (in this case Niger), so in effect there are 9 variables.
The original ‘human’ dataset includes variables that relate to the Human Development Index (HDI), and to other indexes such as Gender Development Index (GDI), and combines indicators from most countries. The dataset originally consisted of more than the 9 variables shown in my dataset, but it has been wrangled and reduced to 9 variables, and to only obsrvations that relate to countries, and it includes only those observations without missing values. The variable ‘Country’ is the name of the country in question. The other 8 variables in my dataset relate to two topics: Health and knowledge, and Empowerment. The variables related to Empowerment in my dataset are:
The variables related to Health and knowledge in my dataset are:
The source for the definitions of the variables is this website.
Here is a graphical overview of the data:
library(ggplot2)
library(GGally)
GraphOV <- ggpairs(human, mapping = aes(), title = "Graphical Overview", lower = list(combo = wrap("facethist", bins = 20)))
GraphOV
Here are also summaries of the variables in my dataset (the knitr part of the code just formats the result into a nice table):
knitr::kable(summary(human))
| High_educ_f_m | F_m_working | Educ_exp | Life_exp | GNI_ | Mat_deaths | Young_births | Parl_seats | |
|---|---|---|---|---|---|---|---|---|
| Min. :0.1717 | Min. :0.1857 | Min. : 5.40 | Min. :49.00 | Min. : 581 | Min. : 1.0 | Min. : 0.60 | Min. : 0.00 | |
| 1st Qu.:0.7264 | 1st Qu.:0.5984 | 1st Qu.:11.25 | 1st Qu.:66.30 | 1st Qu.: 4198 | 1st Qu.: 11.5 | 1st Qu.: 12.65 | 1st Qu.:12.40 | |
| Median :0.9375 | Median :0.7535 | Median :13.50 | Median :74.20 | Median : 12040 | Median : 49.0 | Median : 33.60 | Median :19.30 | |
| Mean :0.8529 | Mean :0.7074 | Mean :13.18 | Mean :71.65 | Mean : 17628 | Mean : 149.1 | Mean : 47.16 | Mean :20.91 | |
| 3rd Qu.:0.9968 | 3rd Qu.:0.8535 | 3rd Qu.:15.20 | 3rd Qu.:77.25 | 3rd Qu.: 24512 | 3rd Qu.: 190.0 | 3rd Qu.: 71.95 | 3rd Qu.:27.95 | |
| Max. :1.4967 | Max. :1.0380 | Max. :20.20 | Max. :83.50 | Max. :123124 | Max. :1100.0 | Max. :204.80 | Max. :57.50 |
From the summary table I see immediately that many of the variables have fairly large ranges. For example some countries have no women at all in parliament (as the min value of Parl_seats is 0), whereas some have more women in parliament than men (as the max value is above 50%). This already suggests to me that some countries in the dataset are quite different from each other. Similarly there seems to be large differences between some countries with regard to all the other variables. For example in some countries the expected years of schooling are under 6 and in others above 20. At the same time, expected years of schooling seems to be among those variables where many countries also are fairly similar, with 50% of countries expecting inhabitants to have between 11 and 15 years of schooling. This variable is also the one that seems to be the closest to being normally distributd from all the variables when looking at the graphical overview.
The graphical overview indicates that some of the variables may be related to one another. For example it seems that as adolescent birth rate increase, so does maternal mortality ratio. Not surprisingly, there also seems to be a relationship between maternal mortality ratio and life expectancy, with higher maternal mortality ratios corresponding to lower lif expectancies. In fact, a great number of the variables seem to be correlated. Another example of such variables are education and life expectancies: the higher the education expectancy, the higher also the life expectancy and vice versa. Among those variable pairs that don’t seem to be strongly correlated are e.g. expected years of schooling and percetange of female representatives in parliament.
Next I will perform principal component analysis (PCA) without standardising the dataset. First I will look at the variability captured by the principal components. I will also draw a biplot displaying the observations with PC1 in coordinate x-axis and PC2 in y-axis.
pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
biplot(pca_human, choices = 1:2, cex = c(1, 1), col = c("grey40", "deeppink2"), sub = "Plot 1: PC1 & PC2 with non standardised dataset")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
However, I should first standardise the variables to be able to conduct analysis, so I repeat the PCA, and look at the variability and biplot of the tandardised dataset next.
human_std <- scale(human)
pca_human_std <- prcomp(human_std)
summary(pca_human_std)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
biplot(pca_human_std, choices = 1:2, cex = c(1, 1), col = c("grey40", "deeppink2"), sub = "Plot 2: PC1 & PC2 with standardised dataset")
When comparing the xxx. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions in you plots where you describe the results by using not just your variable names, but the actual phenomenons they relate to
Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data
Next I move to another dataset - the tea dataset from the FactoMineR package.
library(FactoMineR)
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
The dataset seems to have 36 variables and 300 observations. The variables are e.g. age, sex, lunch and dinner, and seem to relate to tea consumption. I want to only keep some interesting variables in my analysis, so I choose to focus on variables:
I will next visualise these variables from the dataset.
library(tidyr)
library(dplyr)
keep_columns <- c("frequency", "Tea", "sugar", "tea.time", "sex")
te1 <- dplyr::select(tea, one_of(keep_columns))
gather(te1) %>% ggplot(aes(value)) + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) + facet_wrap("key", scales = "free")
## Warning: attributes are not identical across measure variables; they will
## be dropped
Next I will perform MCA on the variables of the tea dataset that I kept for analysis.
mca_tea <- MCA(te1, graph = FALSE)
summary(mca_tea)
##
## Call:
## MCA(X = te1, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.316 0.237 0.217 0.201 0.189 0.177
## % of var. 19.737 14.823 13.555 12.562 11.836 11.040
## Cumulative % of var. 19.737 34.559 48.114 60.676 72.512 83.552
## Dim.7 Dim.8
## Variance 0.146 0.117
## % of var. 9.112 7.336
## Cumulative % of var. 92.664 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.739 0.576 0.302 | 0.514 0.372 0.146 | 0.162 0.040
## 2 | -0.171 0.031 0.018 | 0.628 0.555 0.243 | -0.186 0.053
## 3 | -0.785 0.650 0.714 | -0.250 0.088 0.073 | -0.129 0.026
## 4 | 0.929 0.911 0.661 | -0.179 0.045 0.024 | -0.287 0.127
## 5 | 0.152 0.024 0.021 | 0.277 0.108 0.068 | -0.039 0.002
## 6 | 0.507 0.272 0.201 | 0.227 0.073 0.040 | -0.443 0.302
## 7 | -0.028 0.001 0.000 | 0.041 0.002 0.001 | 0.358 0.197
## 8 | -0.486 0.249 0.105 | 0.315 0.139 0.044 | 0.799 0.982
## 9 | -0.298 0.094 0.087 | 0.042 0.002 0.002 | 0.062 0.006
## 10 | -0.038 0.002 0.001 | 0.970 1.323 0.581 | 0.410 0.258
## cos2
## 1 0.014 |
## 2 0.021 |
## 3 0.019 |
## 4 0.063 |
## 5 0.001 |
## 6 0.153 |
## 7 0.055 |
## 8 0.283 |
## 9 0.004 |
## 10 0.104 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## 1/day | 0.394 3.108 0.072 4.634 | -0.037 0.036 0.001
## 1 to 2/week | 0.772 5.543 0.103 5.538 | -0.227 0.636 0.009
## +2/day | -0.604 9.772 0.268 -8.944 | 0.084 0.254 0.005
## 3 to 6/week | 0.155 0.173 0.003 0.960 | 0.082 0.064 0.001
## black | -0.456 3.243 0.068 -4.508 | 1.055 23.164 0.365
## Earl Grey | 0.080 0.261 0.012 1.860 | -0.632 21.698 0.721
## green | 0.553 2.132 0.038 3.363 | 1.332 16.467 0.219
## No.sugar | -0.573 10.743 0.351 -10.243 | 0.478 9.946 0.244
## sugar | 0.612 11.483 0.351 10.243 | -0.511 10.632 0.244
## Not.tea time | 0.712 14.036 0.393 10.846 | 0.323 3.836 0.081
## v.test Dim.3 ctr cos2 v.test
## 1/day -0.435 | -0.885 22.865 0.363 -10.416 |
## 1 to 2/week -1.625 | 1.173 18.618 0.237 8.411 |
## +2/day 1.249 | 0.056 0.122 0.002 0.829 |
## 3 to 6/week 0.505 | 0.745 5.803 0.071 4.607 |
## black 10.441 | 0.942 20.177 0.290 9.319 |
## Earl Grey -14.687 | -0.103 0.632 0.019 -2.397 |
## green 8.099 | -1.508 23.079 0.281 -9.169 |
## No.sugar 8.542 | -0.176 1.468 0.033 -3.138 |
## sugar -8.542 | 0.188 1.569 0.033 3.138 |
## Not.tea time 4.913 | -0.132 0.706 0.014 -2.015 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## frequency | 0.294 0.012 0.514 |
## Tea | 0.089 0.727 0.476 |
## sugar | 0.351 0.244 0.033 |
## tea.time | 0.393 0.081 0.014 |
## sex | 0.452 0.122 0.048 |
Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots
plot(mca_tea, invisible = c("ind"), habillage = "quali")